Redcedar Data Analyses Instructions

Please note this analysis and R Markdown document are in still in development :)

Approach

The overall approach is to model empirical data collected by community scientists with ancillary climate data to identify important predictors of western redcedar dieback.

Data Wrangling

Import iNat Data - Empirical Tree Points (Response variables)

The steps for wrangling the iNat data are described here.

Format and export for collecting climateNA data

Data were subset to include only gps information to use in collecting ancillary data.

Remove iNaturalist columns and explanatory variables not needed for random forest models

Import Normals Data

Climate data then extracted with ClimateNA tool following the below process. Data were downloaded for the iNat GPS locations using the ClimateNA Tool.

ClimateNA version 7.42 -

  • Climate data extraction process with ClimateNA
    • Convert data into format for climateNA use (see above)
    • In ClimateNA
      • Normal Data
        • Select input file (browse to gps2566 file)
        • Choose ‘More Normal Data’
          • Select ‘Normal_1991_2020.nrm’
        • Choose ‘All variables(265)’
        • Specify output file
  • Grouping explored
    • data averaged over 30 year normals (1991-2020)

Variables

Note the below analysis uses the iNat data with 1510 observations. Amazing!

  • Response variables included in this analysis
    • Tree canopy symptoms (binary)
  • Explanatory variables included
    • Climate data
      • 30yr normals 1991-2020 (265 variables - annual, seasonal, monthly)

Remove specific climate variables not useful as explanatory variables (e.g. norm_Latitutde)

Remove Outliers

For some reason there is one observation with a super neg cmi value (-10000 ish)

Seperate climate variable groupings

Normals data for 265 variables were downloaded for each point Monthly - 180 variables represented data averaged over months for the 30 year period Seasonal - 60 variables respresented data averaged over 3 month seasons (4 seasons) for 30 year period Annual - 20 variables represented data averaged for all years during 30 year period

Remove variables with variables that have near zero standard deviations (entire column is same value)

Full

There were length(normals)-length(normals.nearzerovar monthly variables with zero standard deviation is Dropping columns with near zero standard deviation removed length(normals)-length(normals.nearzerovar monthly climate variables.

Monthly

There were length(normals.monthly)-length(normals.monthly.nearzerovar monthly variables with zero standard deviation is Dropping columns with near zero standard deviation removed length(normals.monthly)-length(normals.monthly.nearzerovar monthly climate variables.

Seasonal

There were length(normals.monthly)-length(normals.seasonal.nearzerovar monthly variables with zero standard deviation is

Annual

There were length(normals.monthly)-length(normals.annual.nearzerovar monthly variables with zero standard deviation.

Remove other explanatory variable categories (binary or five categories)

Compare model errors

Five categorical response

Full Normal Model

## 
## Call:
##  randomForest(formula = reclassified.tree.canopy.symptoms ~ .,      data = five.cats.full, ntree = 2001, importance = TRUE, proximity = TRUE,      na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 42.19%
## Confusion matrix:
##                 Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top              84     133    28              32           21   0.7181208
## Healthy               66    1201    95              78           32   0.1841033
## Other                 25     170    71              38           12   0.7753165
## Thinning Canopy       33     175    36              88            7   0.7404130
## Tree is Dead          25      50     9              12           32   0.7500000

Monthly Normal Model

## 
## Call:
##  randomForest(formula = reclassified.tree.canopy.symptoms ~ .,      data = five.cats.monthly, ntree = 2001, importance = TRUE,      proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 42.15%
## Confusion matrix:
##                 Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top              83     136    28              31           20   0.7214765
## Healthy               64    1209    90              78           31   0.1786685
## Other                 25     177    69              33           12   0.7816456
## Thinning Canopy       33     181    33              84            8   0.7522124
## Tree is Dead          24      52     9              11           32   0.7500000

Seasonal Normal Model

## 
## Call:
##  randomForest(formula = reclassified.tree.canopy.symptoms ~ .,      data = five.cats.seasonal, ntree = 2001, importance = TRUE,      proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 43.01%
## Confusion matrix:
##                 Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top              82     139    26              32           19   0.7248322
## Healthy               69    1195    96              82           30   0.1881793
## Other                 27     174    66              38           11   0.7911392
## Thinning Canopy       36     177    37              80            9   0.7640118
## Tree is Dead          25      53     7              11           32   0.7500000

Annual Normal Model

## 
## Call:
##  randomForest(formula = reclassified.tree.canopy.symptoms ~ .,      data = five.cats.annual, ntree = 2001, importance = TRUE,      proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 42.66%
## Confusion matrix:
##                 Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top              85     135    26              32           20   0.7147651
## Healthy               68    1193    94              82           35   0.1895380
## Other                 26     178    66              38            8   0.7911392
## Thinning Canopy       32     180    32              87            8   0.7433628
## Tree is Dead          24      51     9              11           33   0.7421875

Binary Normal Model

Full Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.full,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 31.49%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy      1107       365   0.2479620
## Unhealthy     439       642   0.4061055

Monthly Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.monthly,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 31.84%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy      1099       373   0.2533967
## Unhealthy     440       641   0.4070305

Seasonal Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.seasonal,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 32.47%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy      1094       378   0.2567935
## Unhealthy     451       630   0.4172063

Annual Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 32.47%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy      1091       381   0.2588315
## Unhealthy     448       633   0.4144311

Summary of model performance

Response Explanatory Vars tried split OOB Error
5 class Full 14 42.56
5 class Monthly 12 42.6
5 class Seasonal 7 42.83
5 class Annual 4 42.79
Binary Full 14 31.61
Binary Monthly 12 32.07
Binary Seasonal 7 32.58
Binary Annual 4 32.11

Identify important variables

Binary Response, Annual Explanatory Variable

2001 trees is overkill because the error rate stabilizies after about 800 runs.

Binary Response, Seasonal Explanatory Variable

Clearly all of the climate variables are highly correlated.

Lets pick the top performing metric in our random forests analyses, CMI and then any less correlated variables

Below we can check the correlation of CMI, MAP, and DD_18

Now we can check how the model performs with only these three climate variables

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 32.47%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy      1091       381   0.2588315
## Unhealthy     448       633   0.4144311

It’s hard to give up the seasonality data, but they are all highly correlated (data not shown) and if we look at the above importance plot for the seasonality data, the winter variables (norm_CMI_wt,norm_DD_18_wt, and norm_PPT_wt) all had the highest MeanDecrease Accuracy and Gini. Therefore, even if we chose to build the model on seasonal data, we would likely want to choose to use the winter values for each variable.